Day 1 Session 2:
Dynamic spectral analysis
using Generalised Additive Mixed-Effect Models (GAMMs)
Introduction
In the previous sesseion, we tried running a statistical analysis using linear mixed-effect modelling. It was a static analysis based on a single point measurement of formant frequencies to characterise differences of the English /l/ and /ɹ/ quality produced by L1 Japanese and L1 English speakers.
While we observed some key between-group differences in the static data, the static analysis did not allow us to answer why such differences would occur. Specifically, it was not very unclear whether it is just the height of the formant frequencies that are different, or the formant frequency difference is just a small part of a broader, more fundamental difference.
More crucially, analysing the liquid consonants based on the static analysis lacks the consideration that English liquid consonants are inherently dynamic. This means that their spectral properties vary as a function of time, and a single-point measurement of formant frequencies is not adequate in characterising the English liquid quality. Also, English liquids show a strong interaction with the neighbouring vowels, so we need to consider how English liquids are coarticulated with the following vowel.
Taken together, we need to take the temporal dimension into account when analysing English liquids. The possible liquid-vowel interaction may mean that L1 Japanese speakers may produce the sequence of liquid consonant and a vowel with a different interaction pattern from that of L1 English speakers.
In this session, therefore, let’s try modelling the whole formant contours directly using Generalised Additive Mixed-Efefct Models (GAMMs).
Generalised Additive Mixed-Effect Models (GAMMs)
Preliminaries
Installing/loading packages
As always, let’s first install and load R packages that we are using in the static analysis section. The installation commands have been commented out, but please uncomment them and install the packages if you do not have them on your machine yet.
# installing packages
# install.packages("tidyverse")
# install.packages("mgcv")
# install.packages("itsadug")
# install.packages("tidymv")
# install.packages("tidygam")
# importing packages
library(tidyverse)
library(mgcv)
library(itsadug)
# library(tidymv) # for 'get_gam_predictions'
library(tidygam) # for 'get_gam_predictions'
source("https://raw.githubusercontent.com/soskuthy/gamm_intro/master/gamm_hacks.r")Check data
Similarly to the static analysis, let’s spend some time inspecting
the data set using colnames().
## [1] "...1" "file" "speaker" "language"
## [5] "f0" "duration" "segment" "previous_sound"
## [9] "next_sound" "word" "f1" "f2"
## [13] "f3" "time" "IsApproximant" "IsAcoustic"
## [17] "note" "gender" "omit" "Barkf1"
## [21] "Barkf2" "Barkf3" "f2f1" "f3f2"
## [25] "Barkf2f1" "Barkf3f2" "position" "context"
## [29] "liquid" "input_file"
The majority of the variables in the data set are quite similar to the static data set. The dynamic data set also contains bark-normalised formant frequencies, but we won’t be using them here.
The crucial difference between the static and the dynamic data sets
is the time column. As I explained earlier, the crucial
aspect in the dynamic analysis is the time dimension, and
we are interested in the time-varying characteristics in formant
trajectories.
Let’s look into the time column in more detail. The
following code displays what information is contained in the
time column.
## # A tibble: 11 × 1
## time
## <dbl>
## 1 0
## 2 10
## 3 20
## 4 30
## 5 40
## 6 50
## 7 60
## 8 70
## 9 80
## 10 90
## 11 100
According to the code above, the time dimension contains
numbers from 0 to 100 with an increment of 10. The data set expresses
the time dimension proportionally from 0% to 100%. This
means that the beginning of an interval corresponds to 0% time point,
and then we get time points 10%, 20%, 30% etc as time goes by, until
100% that corresponds to the end of an interval.
The next question here: what interval are we talking about? In the introduction section, I talked about the interaction between the liquid consonant and the following vowel. For this reason, we will treat the liquid and vowel intervals as one whole entity, in which each interval contains movement of the formant trajectories throughout the word-initial liquid and vowel intervals. This means that 0% time point corresponds to the onset of the word-initial liquid, and 100% time point corresponds to the end point of the following vowel.
Data wrangling
Omitting irrelavent columns
Let’s tidy up the data a little bit to avoid further confusion. As
done with the static anlaysis, we will try to remove columns that are no
longer necessary. The three columns, IsApproximant,
IsAcoustic, and omit can safely be removed, as
well as some spectral measures including Bark-normalised and distance
measures as they can be confusing later on.
# Let's check the number of "approximant" tokens
df_dyn |>
dplyr::group_by(IsApproximant) |>
dplyr::summarise() |>
dplyr::ungroup()## # A tibble: 1 × 1
## IsApproximant
## <chr>
## 1 yes
# Let's check the number of tokens of good recording quality
df_dyn |>
dplyr::group_by(IsAcoustic) |>
dplyr::summarise() |>
dplyr::ungroup()## # A tibble: 1 × 1
## IsAcoustic
## <chr>
## 1 yes
## # A tibble: 1 × 1
## omit
## <dbl>
## 1 0
# Remove columns that we no longer need
df_dyn <- df_dyn |>
dplyr::select(-c(IsApproximant, IsAcoustic, omit, Barkf1, Barkf2, Barkf3, Barkf2f1, Barkf3f2, f2f1, f3f2))Let’s check the column names again.
## [1] "...1" "file" "speaker" "language"
## [5] "f0" "duration" "segment" "previous_sound"
## [9] "next_sound" "word" "f1" "f2"
## [13] "f3" "time" "note" "gender"
## [17] "position" "context" "liquid" "input_file"
Note that we have a column called context: this can be
useful for grouping so let’s keep them. But we can convert them into IPA
symbols for a more intuitive representation:
Checking the number of participants, tokens…
Let’s also obtain some descriptive statistics here. Note that we need to divide the number of rows by 11 to obtain the accurate number of tokens, as one token now has 11 time points.
# number of participants
df_dyn |>
dplyr::group_by(language) |>
dplyr::summarise(n = n_distinct(speaker)) |>
dplyr::ungroup()## # A tibble: 2 × 2
## language n
## <chr> <int>
## 1 English 14
## 2 Japanese 31
# number of tokens per segment
df_dyn |>
dplyr::group_by(segment) |>
dplyr::summarise(n = n()/11) |> # divide by 11 time points
dplyr::ungroup()## # A tibble: 6 × 2
## segment n
## <chr> <dbl>
## 1 LAE1 511
## 2 LIY1 408
## 3 LUW1 299
## 4 RAE1 502
## 5 RIY1 481
## 6 RUW1 314
Your turn
Similarly in the static analysis, please explore the data a little to understand the data better.
You can start with checking the column names to see what variables
are available in the data set. Then, use dplyr::group_by(),
dplyr::summarise() and dplyr::ungroup()
functions to inspect the data.
Data visualisation
Scaling formant frequencies
Now, let’s visualise the dynamic data. The basic procedure is the same as in the static analysis; We first apply z-score normalisation to the formant frequencies to make sure that formant values are comparable across speakers.
df_dyn <- df_dyn |>
dplyr::group_by(speaker) |> # tell R to do the following iteration per speaker
dplyr::mutate(
f1z = as.numeric(scale(f1)), # scale f1 into z-score
f2z = as.numeric(scale(f2)), # scale f2 into z-score
f3z = as.numeric(scale(f3)) # scale f3 into z-score
) |>
dplyr::ungroup() # don't forget ungroupingDescriptive statistics
Let’s check the mean and SD for both raw and normalised formant values: just see F1 for now. Note that the mean z-scores do not seem to look zero, but this is because computers are not very good at dealing with very small numbers (e.g., decimals) and some fluctuations occur in computing the values.
# check mean and sd of raw/scaled F1 values for each speaker
df_dyn |>
dplyr::group_by(speaker) |>
dplyr::summarise(
f1_mean = mean(f1),
f1_sd = sd(f1),
f1z_mean = mean(f1z),
f1z_sd = sd(f1z)
) |>
dplyr::ungroup() ## # A tibble: 45 × 5
## speaker f1_mean f1_sd f1z_mean f1z_sd
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2d57ke 468. 140. -9.05e-17 1
## 2 2drb3c 575. 243. 2.39e-16 1
## 3 2zy9tf 459. 228. -2.41e-16 1
## 4 3bcpyh 438. 142. -5.72e-18 1.00
## 5 3hsubn 537. 177. 6.11e-17 1
## 6 3pzrts 458. 195. -1.44e-18 1.00
## 7 4ps8zx 467. 172. 2.19e-16 1
## 8 54i2ks 453. 192. 1.43e-16 1.00
## 9 5jzj2h 412. 133. 2.49e-16 1.00
## 10 5upwe3 444. 189. -6.51e-17 1.00
## # ℹ 35 more rows
Your turn
Write a code chunk to inspect the mean and sd of raw/scaled F2 and F3
values for each speaker and for speaker
group. Think how you group the data in
dplyr::group_by().
Visualisation
raw trajectories
Let’s visualise the formant trajectories here. In the dynamic analysis, it is almost always the case that we take the time dimension on the x-axis and the dependent variable on the y-axis. This allows us to see how e.g., F1 changes over time.
We also make sure about the variable grouping to
tell ggplot2 how to organise the data. This can be done via
group argument in the geom function. Each
formant trajectory should come from one audio file, which is stored in
the file column. We use this information for grouping.
# F1 - raw trajectories
df_dyn |>
ggplot(aes(x = time, y = f1z)) +
geom_point(aes(colour = language, group = file), width = 0.3, alpha = 0.4) +
geom_path(aes(colour = language, group = file), width = 0.3, alpha = 0.4) +
geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.5) +
scale_colour_manual(values = cbPalette) +
facet_grid(liquid ~ context) +
labs(x = "time", y = "F1 (z-normalised)", title = "time-varying change in F1 frequency") +
theme(strip.text.y = element_text(angle = 0))smooths
While I usually prefer just plotting raw trajectories because it is faithful to the nature of the data, I must admit that it is sometimes very difficult to see what’s going on there.
If you prefer, we could also just plot smooths to
highlight the nonlinear between-group difference. The code below adds
smoothed F1z trajectories to the raw data we just plotted (but I have
commented out the raw trajectories for now). Note the difference in
grouping; we used the file variable for the raw
trajectories, but for smooths we need to use the language
variable because we would like one smoothed trajectory for each L1
group.
# F1 - smooths
df_dyn |>
ggplot(aes(x = time, y = f1z)) +
# geom_point(aes(colour = language, group = file), width = 0.3, alpha = 0.1) +
# geom_path(aes(colour = language, group = file), width = 0.3, alpha = 0.1) +
geom_smooth(aes(colour = language, group = language), linewidth = 1.2, se = TRUE) +
geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.5) +
scale_colour_manual(values = cbPalette) +
facet_grid(liquid ~ context) +
labs(x = "time", y = "F1 (z-normalised)", title = "smoothed time-varying change in F1 frequency") +
theme(strip.text.y = element_text(angle = 0))## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
How do we model F2 trajectory?
I hope that you have enjoyed data visualisation! You’d know if you have tried plotting F2, but it seems that the F2 trajectories show very interesting between-group difference.
This is also a very interesting dimension to explore from the theoretical point of view, because previous research has claimed that L1 Japanese speakers would make it easier to acquire the use of F2 in a target-like manner. But the dynamic data suggests that L1 Japanese speakers do something really different from L1 English speakers. Let’s take a look at trajectories first below:
# F2
df_dyn |>
ggplot(aes(x = time, y = f2z)) +
geom_point(aes(colour = language, group = file), width = 0.3, alpha = 0.1) +
geom_path(aes(colour = language, group = file), width = 0.3, alpha = 0.1) +
geom_smooth(aes(group = language), colour = "white", linewidth = 2.8, se = FALSE) +
geom_smooth(aes(colour = language, group = language), linewidth = 1.8, se = TRUE) +
geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.5) +
scale_colour_manual(values = cbPalette) +
facet_grid(liquid ~ context) +
labs(x = "time", y = "F2 (z-normalised)", title = "raw/smoothed time-varying change in F2 frequency") +
theme(strip.text.y = element_text(angle = 0))## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
L1 English speakers follow somewhat consistent patterns across vowel contexts for both English /l/ and /ɹ/. They all start at a lower F2 at the beginning of the liquid onset (= 0%). The trajectories then go higher up to the maximal point in the middle of the interval (= around 50%). After reaching the highest peak, the trajectories go down towards the end of the interval (= 100%). Although there are some differences in terms of the timing that they achieve the maximal point, the overall patterns are fairly consistent.
L1 Japanese speakers, on the other hand, show distinct trajectory patterns across vowel contexts. In the /æ/ context, both English /l/ and /ɹ/ trajectories show an almost monotonic, linear decrease from the liquid onset (= 0%) towards the vowel offset (= 100%). The trajectories in the /u/ context are also similar. In these two vowel contexts, the English /ɹ/ trajectories seem to mark the highest point at around 25% time point, but it is not as pronounced as that of L1 English speakers.
Their /i/ trajectories, on the other hand, show a similar pattern to that of L1 English speakers. They start at a lower F2 value at the liquid onset (= 0%), go higher up, and then show a slight decrease towards the end of the interval. The timing of the maximal point, however, is quite early (i.e., at around 40-45% time point) compared to L1 English speakers.
Your turn
Please compare and discuss the static and dynamic plots for F2.
Note: The static analysis is based on the F2 measurement at the liquid midpoint. The dynamic analysis shows time-varying changes in F2 over liquid-vowel interval.
From linear to non-linear modelling: Introducting Generalised Additive Mixed Effects Models (GAMMs)
Modelling non-linearity in the data
Here, we would like to model the formant trajectory directory without relying just on a straight line. This can be illustrated by the plot below that shows the time-varying transition of F2 signals (in Hz) for the word reap produced by an L1 Japanese speaker.
As you can see in the plot below, the linear regression line (in yellow) does not capture the overall trend of the formant trajectory anymore. What we would ideally like is the non-linear curve in skyblue.
One solution to the non-linear modelling is via Generalised Additive (Mixed Effects) Models: GA(M)Ms. While GA(M)Ms are similar with the linear models in that they both can capture linear trends with parametric terms, GA(M)Ms also incorporate smooth terms that can capture non-linear relationships between the variables. Put it simply, GA(M)Ms is a combination (i.e., addition) of multiple (smooth) functions, which can be notated as:
\(y = \alpha + f(x) + \epsilon\)
where \(f(x)\) just means some function of \(x\) and \(\epsilon\) indicates an error term.
Basis functions
GA(M)Ms models the non-linearity in the data by combining multiple basis functions. The number of basis functions is related to the smoothness/wiggliness of a given GA(M)Ms model, such that, for example, there is more room for a GA(M)Ms trajectory to be smoother if the number of basis functions is high. Smoothness/wiggliness of a GA(M)Ms model is determined by the smoothing parameters, which GAM automatically estimate based on the data, and the number of basis function sets an upper limit as to how smooth a model can be.
What you can adjust in your model is the number of knots. A knot is a converging point between each of the basis functions, and the number of knots corresponds to the number of basis functions + 1. In the plot below, I show ten basis functions based on cubic regression splines, which means that there are eleven knots that I specify in the GAM model.
There are a few types of basis functions that can be specified in the GAMMs model. Common ones include thin plate regression splines (tp) and cubic regression splines (cr). They look quite different from each other, but resulting final smooth curve ends up very similar.
Parametric and smooth terms
Let’s now go back again to our data set and see how a GAMMs model can be specified on R. For example, the model below is a GAM model (without a random effect) that captures time-varying non-linearity along the F2 dimension:
bam(f2z ~ language + context + s(time) + s(time, by = language) + s(time, by = context), data = df_dyn_L)The model specification notation is somewhat similar to the one for
the lme4::lmer() convention for the linear models, so
hopefully this isn’t too complex for you.
As explained earlier, a GAM model can take two kinds of predictor
variables: parametric and smooth
terms. Parametric terms specify predictors that would
influence a constant, overall influence on the dependent variable. This
usually corresponds to the height of the trajectory. In
the model above, this corresponds to language and
context.
The new addition from the linear model would be smooth
terms which allow GA(M)Ms to fit non-linear effects as a
function of a variable. Smooth terms are notated as s().
This usually corresponds to the shape of the
trajectory. And in the model above, you can see a series of
s() terms, including s(time),
s(time, by = language) and
s(time, by = context). We will cover the specifics later,
but basically these instruct GAM to model the non-linearity seen
in F2 (dependint variable) as a function of time.
Let’s model F2 trajectory
Let’s try modelling a simple GAMMs model based on our data set. We
start with a simple GAM model to model the effects of
language and context on the F2 trajectory.
As usual, we’ll first subset the data and then convert the variables into factor, a similar procedure to the static analysis.
# subsetting data -- for /l/
df_dyn_L <- df_dyn |>
dplyr::filter(liquid == "L")
# language variable
df_dyn_L$language <- as.factor(df_dyn_L$language)
levels(df_dyn_L$language)## [1] "English" "Japanese"
## [1] "/æ/" "/i/" "/u/"
# speaker variable -- random effect
df_dyn_L$speaker <- as.factor(df_dyn_L$speaker)
levels(df_dyn_L$speaker)## [1] "2d57ke" "2drb3c" "2zy9tf" "3bcpyh" "3hsubn" "3pzrts" "4ps8zx" "54i2ks"
## [9] "5jzj2h" "5upwe3" "6p63jy" "7cd4t4" "9c4efu" "bfwizh" "birw55" "bj8mjm"
## [17] "byxcff" "c5y8z6" "cdsju7" "cu2jce" "dbtzn2" "dcxuft" "ds6umh" "f9japd"
## [25] "fgd95u" "fkcwjr" "h5s4x3" "heat7g" "hgrist" "i3wa7f" "jcy8xi" "kjn9n4"
## [33] "m46dhf" "m5r28t" "s6a8gh" "srky8g" "tay55n" "uig6n9" "ut4e5m" "we8z58"
## [41] "wrgwc3" "xub9bc" "z3n578" "zajk25" "zz3r2g"
## [1] "lamb" "lamp" "lap" "leaf" "leap" "leave" "loom" "lube"
A simple (linear) model only with parametric terms
What if we just model F2 only with parametric terms? Let’s just start with language variable.
# fit a model with parametric terms only
m1 <- bam(f2z ~ language, data = df_dyn_L, method = "fREML")
# model summary
summary(m1)##
## Family: gaussian
## Link function: identity
##
## Formula:
## f2z ~ language
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04019 0.01355 2.965 0.00303 **
## languageJapanese -0.02048 0.01773 -1.155 0.24803
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## R-sq.(adj) = 2.5e-05 Deviance explained = 0.00996%
## -REML = 19166 Scale est. = 1.0225 n = 13398
The model output displays a section called Parametric
coefficients which shows how each of the paraemetric term has
an overall effect on the F2 values. But overall, the output looks pretty
much the same with the linear mixed-effect modelling using
lme4::lmer().
Adding non-linearity
Our first model m1 models the constant effect on the
language variable on the F2 values; that is, we could ask
whether L1 English and L1 Japanese speakers are overall different in F2
frequencies. But also, as we saw earlier in data visualisation, there
was a lot going on beyond the constant between-group difference.
Let’s extend the model so that we can model the non-linear
between-group difference. Here, we add a new smooth
term s(time) that allows us to model the non-linear
difference between L1 English and L1 Japanese speakers in the F2 values
over time.
# a model with a parametric and a smooth term
m2 <- bam(f2z ~ language + s(time, by = language), data = df_dyn_L, method = "fREML")
# model summary
summary(m2)##
## Family: gaussian
## Link function: identity
##
## Formula:
## f2z ~ language + s(time, by = language)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03474 0.01243 2.796 0.00519 **
## languageJapanese -0.01566 0.01625 -0.963 0.33549
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time):languageEnglish 6.700 7.834 284.86 <2e-16 ***
## s(time):languageJapanese 5.141 6.258 49.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.16 Deviance explained = 16%
## fREML = 18022 Scale est. = 0.8594 n = 13398
The top half of the model summary looks similar to the one for
m1. A new addition here is Approximate significance
of smooth terms; this is where we can evaluate the performance
of the smooth terms. There are two columns that you may not be familar
with, but edf would be the most important here.
edf stands for effective degree of freedom. This signifies how many parameters are needed to represent the smooth.
- Edf indexes the degree of non-linearity of the smooth. An edf value
closer to 1 means that the pattern modelled with
s()is linear.
- Edf indexes the degree of non-linearity of the smooth. An edf value
closer to 1 means that the pattern modelled with
Visualising GAM
Unlike linear models, interpretation of non-linear patterns is quite tricky and complicated. So, data visualiastion is crucial in the modelling of non-linear relationships. Let’s visualise the GAM model that we just modelled.
The package itsadug lets you visualise the F2 smooths
for each language group (L1 English and L1 Japanese speakers) as
follows:
## Summary:
## * language : factor; set to the value(s): English, Japanese.
## * time : numeric predictor; with 30 values ranging from 0.000000 to 100.000000.
## * NOTE : No random effects in the model to cancel.
##
It is also possilbe to visualise the difference in F2 trajectories between the two speaker groups:
## Summary:
## * time : numeric predictor; with 100 values ranging from 0.000000 to 100.000000.
## * NOTE : No random effects in the model to cancel.
##
##
## time window(s) of significant difference(s):
## 0.000000 - 35.353535
## 41.414141 - 100.000000
We start to see some interesting bits here. The first plot suggests that, overall, L1 Japanese speakers (in light blue) show a flat F2 trajectory compared to L1 English speakers (in light pink). The difference between the two trajectories lie almost throughout the time course; the two trajectories cross over each other at approximately 40% time point, where we find little difference, but otherwise the confidence interval for the difference smooth is not overlapping zero consistentely.